Deep learning approach to femoral AVN detection in digital radiography: differentiating patients and pre-collapse stages

Objective This study aimed to evaluate a new deep-learning model for diagnosing avascular necrosis of the femoral head (AVNFH) by analyzing pelvic anteroposterior digital radiography. Methods The study sample included 1167 hips. The radiographs were independently classified into 6 stages by a radiologist using their simultaneous MRIs. After that, the radiographs were given to train and test the deep learning models of the project including SVM and ANFIS layer using the Python programming language and TensorFlow library. In the last step, the test set of hip radiographs was provided to two independent radiologists with different work experiences to compare their diagnosis performance to the deep learning models’ performance using the F1 score and Mcnemar test analysis. Results The performance of SVM for AVNFH detection (AUC = 82.88%) was slightly higher than less experienced radiologists (79.68%) and slightly lower than experienced radiologists (88.4%) without reaching significance (p-value > 0.05). Evaluation of the performance of SVM for pre-collapse AVNFH detection with an AUC of 73.58% showed significantly higher performance than less experienced radiologists (AUC = 60.70%, p-value < 0.001). On the other hand, no significant difference is noted between experienced radiologists and SVM for pre-collapse detection. ANFIS algorithm for AVNFH detection with an AUC of 86.60% showed significantly higher performance than less experienced radiologists (AUC = 79.68%, p-value = 0.04). Although reaching less performance compared to experienced radiologists statistically not significant (AUC = 88.40%, p-value = 0.20). Conclusions Our study has shed light on the remarkable capabilities of SVM and ANFIS as diagnostic tools for AVNFH detection in radiography. Their ability to achieve high accuracy with remarkable efficiency makes them promising candidates for early detection and intervention, ultimately contributing to improved patient outcomes.


Introduction
Avascular necrosis of the femoral head (AVNFH) is seen in almost any age due to disturbance of blood supply to bone tissue.This blood supply disturbance may have traumatic (secondary to femoral neck fracture) or nontraumatic (chronic corticosteroid therapy, alcoholism, smoking, SLE, etc.) causes [1][2][3].
Considering the debilitating consequences and results of late diagnosis in the patient, timely diagnosis, and treatment of AVNFH is extremely necessary and lifesaving [4,5].Because of this, all the efforts of the treatment staff are aimed at diagnosing the early stages of the disease, preventing bone collapse and ultimately preventing the need for hip arthroplasty [6].Due to the few and nonspecific symptoms in the early stages of the disease and considering the overlapping of the non-specific symptoms of the disease with other causes such as transient osteoporosis of the hip, reflex sympathetic dystrophy, and subchondral stress response, it seems reasonable and valuable to use a fast, cheap and reliable method to diagnose the disease [4,7,8].
Hip radiography is the first imaging method for screening patients with hip pain, due to the advantages of low cost and accessibility, which is often performed as an anteroposterior (AP) view.For this reason, the simplest, cheapest, and most accessible diagnostic method, namely AP radiography of the pelvis, still maintains its importance in the whole world as a diagnostic and primary screening method for most traumatic and non-traumatic musculoskeletal problems, including AVNFH [7,9].
According to the Ficat classification which is used in this study, AVNFH is divided into five stages with increasing severity from stage 0 (normal imaging) to stage 4 (end stage of the disease) [10].Evaluation of these stages by doctors, especially radiologists and orthopedists need years of education and training, also considering the non-diagnosis of stage 1 of the disease with any modality, the challenge always is the accurate diagnosis of the 2nd stage and upper stages of the disease by unarmed eyes [11,12].In addition to the need for high experience for diagnosis, especially for 2nd stage in Ficat classification and differentiation between ARCO stage 2 and 3a [13], spending a lot of time and accuracy with the presence of human errors in the diagnosis even for the 3rd and 4th stages of the disease, due to fatigue and high workload, are the reasons for the need to develop deep learning algorithms for the AVNFH diagnosis [14,15].
With the tremendous development of Deep Learning (DL) algorithms, Deep Convolutional Neural Networks (DCNN) have shown acceptable capabilities in disease diagnosis.By using these methods, goals such as more accurate, faster, and less costly diagnosis of diseases have become more accessible and achievable.Artificial intelligence (AI) is the ability of a machine to do tasks like human intelligence [16].DL and DCNN are a subset of AI that uses a multilayered structure to evaluate multiple data [17].Until today, some DCNN algorithms have helped medical doctors in different fields of orthopedics and radiology [18][19][20].Unfortunately, only three studies to date have attempted to aid in the diagnosis of AVNFH using deep learning, possibly due to the nature of AVNFH itself and its diagnostic challenges for deep learning algorithms [21][22][23].
Our study aims to address the limitations of AVN detection in digital radiographic images through a deep learning algorithms approach and to compare the performance of deep learning and physicians.

Study population
Our study was a retrospective study conducted in three centers under the observation of Baqiyatallah University of Medical Sciences (BMSU).The ethics committee of BMSU approved the study design.All pelvic digital radiography and MRI used in this study were extracted from the Baqiyatallah Hospital database which included patients between 2010 and 2020 years.
Patients 18 years old or older who achieved full skeletal maturity were included in this study [24].All investigated patients included in this study had pelvic AP digital radiography and pelvic MRIs.Based on the MRI findings of pelvic AP radiography, we divided the patients into two main groups: patients with normal MRI findings, who were referred for causes of pelvic pain or other reasons other than pelvic pain, and patients with positive findings of AVNFH in pelvic MRI.
712 hips were included in the control pelvic group based on the exclusion criteria (59 hips were excluded due to long time intervals, more than 1 month, between hip radiography and hip MRI, 19 hips were excluded due to coexisting bone abnormalities such as a bone tumor, bone fracture, or orthopedic device in the femoral head and neck, 2 hips were excluded due to either poor quality MRI images or radiograph images).
Additionally, 455 hips were included in the AVNFH pelvic group based on the exclusion criteria (37 hips were excluded due to long time intervals, more than 1 month, between hip radiography and hip MRI, 31 hips were excluded due to coexisting bone abnormalities such as a bone tumor, bone fracture, or orthopedic device in the femoral head and neck, 6 hips were excluded due to either poor quality MRI images or radiograph images).
In our study, we used Adaptive Neuro-Fuzzy Inference System (ANFIS) and Support vector machine (SVM) algorithms to test and train the data as additional layer in DL model.This work was done by comparing stage 0 disease with other stages (non-patient vs. patient) and comparing stage 0 disease with stage 2 disease (non-patient vs. patient with brief findings in digital radiography).According to the mentioned cases, two datasets were set up for train (n = 993 hips) and test (n = 174 hips), to examine non-patients from patients using SVM and ANFIS algorithms.Additionally, two other datasets were set up in SVM for train (n = 803 hips) and test (n = 150 hips), to examine stage 0 from stage 2 of the disease (Fig. 1).
In conclusion, the outcomes derived from the test datasets of SVM and ANFIS algorithms were juxtaposed with the outcomes procured from human resources, specifically radiologists with varying levels of experience, from less experienced to seasoned professionals.

Image preprocessing
The pelvic images of the patients were cropped as left and right hips, and all images were adjusted to 250*250 pixels resolution for comparison, and the left hip images were rotated relative to the right hip for better comparison.

Interpretation and classification of images
Two radiologists divided the patients into 5 groups based on the findings of their recent MRI images and based on the Ficat classification system.
Based on this, the pelvic radiographs were classified as follow: those who had no specific findings for AVNFH in MRI and had no problems in follow-up are classified in stage 0; those who had only bone marrow edema in MRI images and were recognized as AVNFH in the following years are in stage 1; those who have geographical lesions in the femoral head in MRI images are in stage 2; those with a crescent appearance and bone collapse in MRI are in stage 3; and finally, those with severe degenerative changes in addition to cortical bone collapse are placed in stage 4 of the disease [10].
All pelvic radiographs were divided into two separate images of the left and right hip by a radiology expert, and each hip was cut separately in the dimensions and matrix determined by the study and randomly evaluated by two independent radiologists with different work experiences.The presence or absence of AVNFH and the staging of AVNFH were assessed based on the Ficat system.The radiologists commenting on the images and stages of the disease included a second-year radiology resident and a 10-year experienced radiologist.None of these radiologists had a role in the image preparation process, and the study was completely blinded.

Deep learning algorithms SVM-based deep learning
In the context of SVM-based deep learning, our methodological framework embraces a sophisticated approach centered around a meticulously crafted two-layer Convolutional Layer.This intricate convolutional architecture orchestrates the deployment of convolutional kernels, each infused with Rectified Linear Units (ReLU) activation, strategically adopting a kernel size of 3. Following the convolutional layers, we judiciously incorporate MaxPooling with a pool size of 2, meticulously designed to facilitate optimal dimension reduction and extraction of salient features.This thoughtful combination is underpinned by the utilization of hinge loss, a pivotal component known for fostering "maximum margin" classification, thus embodying the SVM paradigm within the realm of deep learning.
To fortify the SVM model's adaptability and curb the risk of overfitting, we introduce L2 kernel regularization into the fray.The augmented SVM loss function (Li, reg) unfolds as a composite expression: In this formulation, λ serves as the regularization parameter, while ∥W∥ 2 encapsulates the L2 norm of the weight matrix W. This regularization mechanism acts as a vigilant guardian, intricately navigating the delicate balance between a model that adeptly fits the data and one that avoids undue complexity.This nuanced and comprehensive approach aspires to harness the synergies between SVM and deep learning, thereby optimizing classification performance while upholding the tenets of model robustness.The entire architecture is implemented utilizing TensorFlow for seamless integration and efficient model training (Fig. 2).

ANFIS-based deep learning
Within the domain of deep learning featuring the ANFIS, our methodology unfolds with a nuanced approach, employing a meticulously designed five-layer Convolutional for Dimension reduction.In this intricate architecture, convolutional kernels take center stage, enriched with the dynamic activation of ReLU and a carefully chosen kernel size of 3.After the convolutional layers, a deliberate integration of MaxPooling, employing a judicious pool size of 2, serves as the cornerstone for optimal dimension reduction and the extraction of intricate features.This thoughtful architectural design not only sets the stage for the ensuing ANFIS model but also establishes a foundation for sophisticated feature representation.
Fig. 2 ANFIS Model layers used in this study The ANFIS model, a pivotal component of our methodology, boasts an elaborate configuration of 40 rules, reflecting a commitment to a nuanced and expansive inference system.Rooted in a hybrid learning approach, the ANFIS model seamlessly amalgamates the principles of fuzzy logic with the adaptive capabilities of neural networks.To facilitate robust training, the Huber loss function (LHuber) is employed, introducing a degree of resilience against the influence of outliers: Here, y signifies the actual output, f(x) represents the predicted output, and δ stands as the Huber loss parameter.The optimization of the ANFIS model is orchestrated through the application of the Adam optimizer, a dynamic and adaptive learning rate algorithm.This strategic choice ensures efficient convergence during the training process, accentuating the adaptability of the ANFIS model.In essence, this meticulously crafted ANFIS architecture, blending the intricacies of fuzzy logic with the adaptability of neural networks, is meticulously designed to not only decipher complex patterns within the data but also mitigate the impact of outliers through the incorporation of Huber loss.The optimization process, driven by the Adam optimizer, underscores our commitment to an approach that is not only robust but also efficient in capturing the nuances of the underlying data structure (Fig. 3).

System architecture and statistical analysis
To perform the considered deep learning model's test and train, we first divided the patient and non-patient hips data into training and testing sets, with a ratio of 6 to 1, and analyzed the obtained data with SVM and ANFIS models (layer based DL model).From now on in the manuscript, instead of SVM or ANFIS deep learning layer model, the words SVM and ANFIS will be used in short.
Fig. 3 Flow chart showing study groups for analysis and machine learning In the next step, according to the importance of distinguishing the first and second stages of disease from normal hip, the data of the mentioned stages (i.e., stages 0 to 2) are separated independently with a ratio of 6 to 1 into training and testing sets, to check the performance of the SVM model.In the training phase, we performed 10-fold cross-validation (CV) to validate the models' performance on the training and testing sets.
Demographic findings were evaluated by descriptive statistics and presented as numbers and mean ± standard deviation (SD).The non-parametric Mann-Whitney U test was used to compare the ages of patients between the two groups.Sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV), accuracy, and AUC were calculated for each model performance and radiologist.
Radiologists and model performance comparison was achieved with the use of receiver operating characteristics (ROC) curves and the area under the curve (AUC) with 95% confidence intervals calculated with bootstrapping using the pROC package [25].A single threshold value at 0.5 was used for the ROC curves given the fact that upon augmentation groups were balanced.Comparisons between the AUCs of the models and readers were performed with DeLong's test in the IBM SPSS Statistics 27.0.1 software.Significance was defined with a p-value lower than 0.05 in all analyses.

Patient demographics
In the study conducted, after applying the mentioned exclusion criteria and removing inappropriate hip digital radiographs, a total of 1167 hips were included in the analysis: 455 hips with AVNFH and 712 hips without AVNFH were included in the study.The average age of the group without AVNFH was equal to 44.6 ± 12.5 and the group with AVNFH was equal to 46.7 ± 12.7, and there was no statistically significant difference between the ages of the noted study groups (p-value = 0.437) (Table 1)(Fig.1).Additionally, no gender difference was observed between the groups.

Radiomics analysis and machine learning model performance
Following data preprocessing and scaling, a dataset of 993 hips (387 AVNFH hips and 606 normal hips) was divided for subsequent DL models training, In the training set, the SVM algorithm exhibited an accuracy of 86.21% (80.18-90.96%)over 100 epochs (Fig. 4).The operating point of the SVM algorithm was set to achieve accuracy for detecting AVNFH.At the highest achieved accuracy, the performance of the test set was checked for the SVM algorithm.The Sensitivity and specificity for the SVM algorithm in detecting AVNFH from normal hips were 67.65% (55.21-78.49%CI)and 98.11% (93.35-99.77%CI),respectively.Additionally, AUC (95% CI) for the SVM algorithm was obtained as 82.88% (74.21-88.13%).This threshold was used throughout subsequent analyses (Table 2).

Comparison of machine learning algorithms to radiologists
In this step, the data that was separated in the previous step as a test set for DL algorithms was given to two radiologists with different work experiences for review, and each radiologist gave his opinion about the set of hip radiographs.In the subset of AVNFH detection (other stages in comparison to stage 0), the less experienced radiologist achieved a sensitivity, specificity, and    3).AUC (95% CI) prepared by deep learning algorithms and radiologists were compared with statistically significant test results of p-value ≤ 0.05 using DeLong's test.The performance of SVM algorithm for AVNFH detection was slightly higher than less experienced radiologist and slightly lower than experienced radiologist without reaching significance (p-value > 0.05) (Table 4).Evaluation of the performance of SVM for pre-collapse AVNFH detection with the AUC of 73.58% showed significantly higher performance than less experienced radiologists (AUC = 60.70%,p-value < 0.001).On the other hand, no significant difference is noted between experienced radiologist and SVM for pre-collapse detection (Table 4).

Discussion
Radiography is the first diagnostic method in patients with pelvic pain and pelvic trauma.Different studies have been conducted in the field of pelvic radiography by DL.Some studies have been carried out to investigate the angles and sizes of the femur bone with artificial intelligence for surgical purposes and have been able to achieve favorable results [26][27][28][29].
Other studies have moved in the direction of diagnosis and have mainly used DL algorithms to investigate bone fractures of the femoral head and neck [30][31][32].In the meantime, we can refer to the study of Hsieh and his colleagues who, with a DL model called DAFDNet, were able to achieve 94.8% accuracy in detecting femoral neck fractures without displacement, which is a better performance compared to older algorithms such as Densenet and U-net [30].In another study conducted by Liu and his colleagues on the diagnosis of femoral intertrochanteric fracture, the DL algorithm, Faster-RCNN was able to achieve an accuracy of 88% more than the diagnostic accuracy of orthopedists [31].In one study based on AVN, Wernér et al. used a segmentation-based DL model to diagnose lunate AVN.They achieved a sensitivity of 93.33%, specificity of 93.28%, accuracy of 93.28%, and AUC of 0.94% (95% 0.88-0.99CI), which had better results than one expert and lower results than another expert [33].
Despite the mentioned studies about femoral head and neck with different algorithms, only a few studies have investigated AVNFH disease and its differentiation from normal cases and other causes [21][22][23].This is while the early and timely diagnosis of AVNFH is extremely beneficial to the patient and prevents the consequences of late diagnosis such as femoral head collapse and the need for surgery [34].
However, it is difficult to diagnose or suspect the disease with the unaided eye of a doctor, especially during the pre-collapse stages of the disease.In this situation, doctors take two conservative and non-conservative approaches in dealing with pelvic pain, the first approach leads to unnecessary pelvic MRIs in most people, and the second approach, is based on denying patients' symptoms and referring to the symptoms as psychosomatic symptoms, sometimes leads to missing the early stages of the disease [35,36].
In our study, we developed and trained two DL models, SVM and ANFIS, that could predict AVNFH in digital radiography.When deciding between ANFIS, SVM, and ANN for image analysis, consider the strengths each model offers.ANFIS excels in handling complex, nonlinear relationships and uncertainties through its fuzzy logic, making it suitable for scenarios where data patterns are intricate and difficult to discern.SVM, on the other hand, is effective in dealing with high-dimensional data and can work well with smaller training datasets, making it a good choice for resource-constrained environments.These models also offer interpretability, with ANFIS providing insights through fuzzy rules and SVM offering clear decision boundaries.These strengths make ANFIS and SVM valuable options when analyzing radiology images, especially when computational resources are limited or when interpretability is crucial.
Our study showed that the performance of both DL models (SVM & ANFIS) in detecting AVNFH is superior to the less experienced radiologist in the detection of AVNFH without statistical significance.These findings are very similar to Li and his colleagues' study which used the proposed AVN-Net algorithm to detect AVNFH with the F1 score of 0.9242 [21].Additionally, similar findings were obtained in the study of Chee and his colleagues in the ability to diagnose AVNFH disease in radiography with sensitivity and specificity of 75.2% and 97.2%, respectively [22].In another study based on MRI AVNFH detection, Klontzas et al. showed similar AVNFH detection performance as their proposed CNN (AUC of 85.50%) compared to two MSK experts (the first expert achieved an AUC of 75.70%, whereas the second  achieved an AUC of 73.08%) without a significant difference [23].
Although there is no significant difference in the ability of DL models and radiologists to distinguish patients from non-patients, it should be noted that the increased workload of radiologists leads to a significant loss of diagnostic power [14].For this reason, it seems that the use of deep learning is more reasonable both in terms of reducing diagnosis time and in terms of reducing diagnostic errors [19,21,37].
Considering the importance of differentiating the precollapse stage from the absence of disease (stage 2 in comparison to stage 0) and the existence of high human error in differentiating these stages [14,38], our study compared the ability of DL models and radiologists to differentiate these states.
The SVM model surprisingly showed a significantly better performance in diagnosing stage 2 than stage 0 compared to the radiology resident with a statistically significant p-value of less than 0.05, which can lead to using the DL models as auxiliary tools in teaching hospitals in the future.In examining the performance of SVM in diagnosing stage 2 from stage 0, no significant difference was seen compared to experienced radiologists, which shows that the DL model can be used in areas where there is no access to experienced radiologists.
It should be noted that the ANFIS model did not perform convincingly in differentiating stage 2 from stage 0, and the results were not acceptable.
The superior performance demonstrated by both SVM and ANFIS algorithms indicates their substantial potential as viable diagnostic instruments for facilitating early detection and intervention.These findings, while promising, underscore the necessity for a more comprehensive understanding of the operational mechanisms that underlie these algorithms.A detailed examination will aid in enhancing their efficacy and reliability, thereby ensuring their optimal performance in clinical predictions and decision-making.
Furthermore, this research serves as a compelling impetus for additional investigation into the algorithms' wider applications in the medical field.A broader adoption of these algorithms in clinical settings could potentially revolutionize diagnostic procedures, particularly in challenging domains where human expertise is limited or where the speed of diagnosis is critical.Longitudinal studies are recommended to evaluate the performance of these algorithms over extended periods.This would provide invaluable insights into their stability, consistency, and adaptability in response to evolving medical data and shifting patient demographics.

Conclusion
The transition from theoretical validation to practical application will require the establishment of rigorous validation protocols and ethical guidelines, to ensure the responsible and equitable use of these sophisticated diagnostic tools.Hence, this necessitates a close collaboration between researchers, clinicians, ethicists, and policymakers, aiming for a holistic integration of these algorithms into the healthcare system, while addressing potential challenges and risks.
In conclusion, our study has shed light on the remarkable capabilities of SVM and ANFIS as diagnostic tools for AVNFH detection in radiography.Their ability to achieve high accuracy with remarkable efficiency makes them promising candidates for early detection and intervention, ultimately contributing to improved patient outcomes.

Fig. 1
Fig. 1 SVM model layers used in this study

Fig. 4
Fig. 4 The behavior observed during the training of SVM model (100 epochs of SVM training).A: Changes in accuracy during training in both the validation and training sets comparing between stage 0 with other stages (non-patient vs. patient).B: Changes in accuracy during training in both the validation and training sets comparing stage 0 with stage 2(non-patient vs. patient with brief findings in digital radiography)

Fig. 5
Fig. 5 The behavior observed during the training of ANFIS model ( 140 epochs of ANFIS training).Changes in accuracy during training in both the validation and training sets comparing stage 0 to other stages (non-patient vs. patient)

Table 1
Demographic information of patients 1: Support vector machine, 2: Adaptive Neuro-Fuzzy Inference System, NA: not appliable

Table 2
Diagnostic performance of Deep Learning (DL) algorithms for avascular necrosis of femoral head (AVNFH) in digital radiography

Table 3
Diagnostic performance of radiologists for avascular necrosis of femoral head (AVNFH) in digital radiography

Measure Less Experienced A (Comparison of stage 0 with stage 2) Experienced B (Comparison of stage 0 with stage 2) Less Experienced A (Com- parison of stage 0 with other stages)
* :Area under the curve, **: Positive Predictive Value, ***: Negative Predictive Value

Table 4
Comparison of SVM algorithm to radiologists

Table 5
Comparison of ANFIS algorithm to radiologists A: second-year radiology resident, B: 10-year experienced radiologist 1: Adaptive Neuro-Fuzzy Inference System DeLong's test p-values on AUC between ANFIS and radiologists * : Area under the curve, **: p-value of the comparison of each reader to ANFIS; ***: statistically significant value